16-820: Advanced Computer Vision - HW6 - 2024ΒΆ
IntroductionΒΆ
In this homework we will work with the state-of-the-art foundation model for segmentation, SAM - Paper. Foundation models are large deep learning neural networks that are trained on massive datasets. SAM and other segmentation models use input prompts such as a box around an object to generate a mask of just the object. We will learn how to run off-the-shelf deep learning models and use them in your work. We will use 2D segmentation masks, combined with camera geometry and depth, to arrive at dense 3D point clouds from 2D segmentations. The steps you will implement are:
- Run SAM on a single image from the dataset.
- Project the mask to 3D points in world coordinates.
- For all the unseen views, do:
- Project all points in world coordinates to the image frame.
- Automatically generate a new input to SAM and run SAM
- Project new mask to world coordinates and append to existing coordinates.
- Filter the point cloud using an off-the-shelf filtering approach.
Homework introduction video here. To submit this homework to gradescope, please submit your code and the output of your code. E.g., for Q4 show the function you implemented and the visualizations created in the for loop. For Q1.1, give the K matrix you computed.
Instructor: Matthew O'Toole
OUT: November 21st, 2024
DUE: December 6th, 2024
TA's: Nikhil Keetha, Ayush Jain, Yuyao Shi
DefinitionsΒΆ
A couple definitions that will hopefully avoid confusion:
These are the existing frames:
- Camera frame/OpenCV Camera Frame: This is the reference frame for 3D points with respect to the camera. This is the camera frame discussed in class.
- Blender Camera Frame: This is the camera frame for 3D points used in Blender, the
yandzaxes point in opposite direction w.r.t. the OpenCV camera frame. - World Frame: This is the frame for 3D points with respect to the world origin. This frame differs by a rigid body transformation from any camera frame.
- Image Frame: The 2D points in the image, e.g.,
u,vin range [0,H] and [0,W].
If we refer to a prompt we mean the box around an object that is to be segmented, which is the input to SAM.
How to run this homeworkΒΆ
We will use deep neural networks which require a cuda-enabled graphics card with at least 12GB of VRAM. The easiest way to get access to this is Google Colab, press the button below to open this homework in Google Colab. You'll find a pretty useful tutorial on how to use Google Colab here.
How to submit this homeworkΒΆ
- You first press run all in the notebook and make sure that all plots come from your code, and are not the plots in the notebook by default.
- Then export as PDF, for the written version of the homework simply submit this and make sure to select all your results and code for the question.
- For the code, submit your iPython notebook file (.ipynb). We can use this to check your homework runs and gives the correct output. If we discover your code cannot reproduce the answer submitted in the written part, you will receive zero points for the question.
- No requirement on filenames.
FAQsΒΆ
Hint: For Q3.1, remember the difference between the image frame (x horizontal), and the coordinates you might get from the mask. E.g., when you retrieve coordinates from the mask, they would not immediate align with the coordinates expected by the intrinsic matrices
You should not have to modify the viz_pts_3d function.
You should only call filter_points if thresh is not None, e.g. if thresh is not None: (call filter_points)
Use another google account, or use Kaggle if you can't connect to Google colab.
The function mask2cam should contain an if statement to check if thresh is None. If thresh is None, do not run filter_points().
You should not have to change any of the given plotting functions, if you do there is probably an error in your code.
In Q4, you should not have to change any code in the main for loop. Only the functions we have separated, i.e., cam2img, keep_dist, filter_for_box, prompt_points_to_box.
Filter_for_box should notΒ take in K as input, the input of the function is already in world coordinates
You shouldΒ notΒ add a thresh argument in mask2cam in the for loop in Q4, this is by design.
Filtering in filter_for_box should happen in world frame.
If you are getting unexpected results in Q4, you might be doing the correction for the Blender frame wrong. Remember, we have OpenCV Frame <-> Blender Frame <-> World frame. The 'transforms' given are Blender Frame <-> World frame, make corrections accordingly.
Another common source of error in Q4 is not getting the correct inverse of the transform. Hint: Google 'inverse of rigid body transforms'.
To get a pdf for this homework, please follow the following steps:
- Download the python notebook
- Open it in jupyter notebook
- Download as html
- Save as pdf
Environment Set-up with Google ColabΒΆ
If running from Google Colab, set using_colab=True below and run the cell. In Colab, be sure to select 'GPU' under 'Edit'->'Notebook Settings'->'Hardware accelerator'.
using_colab = True
if using_colab:
# install everything
import torch
import torchvision
print("PyTorch version:", torch.__version__)
print("Torchvision version:", torchvision.__version__)
print("CUDA is available:", torch.cuda.is_available())
import sys
!{sys.executable} -m pip install opencv-python matplotlib os
! pip install open3d
!{sys.executable} -m pip install 'git+https://github.com/facebookresearch/segment-anything.git'
!mkdir ckpts
!wget -P ckpts https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
! pip install gdown
! gdown 1K375xNjWAwZ7kmhjTuJccC3y6Ik5tN4q #download dataset from gdrive
! unzip images.zip
PyTorch version: 2.5.1+cu121 Torchvision version: 0.20.1+cu121 CUDA is available: True Requirement already satisfied: opencv-python in /usr/local/lib/python3.10/dist-packages (4.10.0.84) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.8.0) ERROR: Could not find a version that satisfies the requirement os (from versions: none) ERROR: No matching distribution found for os Collecting open3d Downloading open3d-0.18.0-cp310-cp310-manylinux_2_27_x86_64.whl.metadata (4.2 kB) Requirement already satisfied: numpy>=1.18.0 in /usr/local/lib/python3.10/dist-packages (from open3d) (1.26.4) Collecting dash>=2.6.0 (from open3d) Downloading dash-2.18.2-py3-none-any.whl.metadata (10 kB) Requirement already satisfied: werkzeug>=2.2.3 in /usr/local/lib/python3.10/dist-packages (from open3d) (3.1.3) Requirement already satisfied: nbformat>=5.7.0 in /usr/local/lib/python3.10/dist-packages (from open3d) (5.10.4) Collecting configargparse (from open3d) Downloading ConfigArgParse-1.7-py3-none-any.whl.metadata (23 kB) Collecting ipywidgets>=8.0.4 (from open3d) Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB) Collecting addict (from open3d) Downloading addict-2.4.0-py3-none-any.whl.metadata (1.0 kB) Requirement already satisfied: pillow>=9.3.0 in /usr/local/lib/python3.10/dist-packages (from open3d) (11.0.0) Requirement already satisfied: matplotlib>=3 in /usr/local/lib/python3.10/dist-packages (from open3d) (3.8.0) Requirement already satisfied: pandas>=1.0 in /usr/local/lib/python3.10/dist-packages (from open3d) (2.2.2) Requirement already satisfied: pyyaml>=5.4.1 in /usr/local/lib/python3.10/dist-packages (from open3d) (6.0.2) Requirement already satisfied: scikit-learn>=0.21 in /usr/local/lib/python3.10/dist-packages (from open3d) (1.5.2) Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from open3d) (4.66.6) Collecting pyquaternion (from open3d) Downloading pyquaternion-0.9.9-py3-none-any.whl.metadata (1.4 kB) Requirement already satisfied: Flask<3.1,>=1.0.4 in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (3.0.3) Collecting werkzeug>=2.2.3 (from open3d) Downloading werkzeug-3.0.6-py3-none-any.whl.metadata (3.7 kB) Requirement already satisfied: plotly>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (5.24.1) Collecting dash-html-components==2.0.0 (from dash>=2.6.0->open3d) Downloading dash_html_components-2.0.0-py3-none-any.whl.metadata (3.8 kB) Collecting dash-core-components==2.0.0 (from dash>=2.6.0->open3d) Downloading dash_core_components-2.0.0-py3-none-any.whl.metadata (2.9 kB) Collecting dash-table==5.0.0 (from dash>=2.6.0->open3d) Downloading dash_table-5.0.0-py3-none-any.whl.metadata (2.4 kB) Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (8.5.0) Requirement already satisfied: typing-extensions>=4.1.1 in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (4.12.2) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (2.32.3) Collecting retrying (from dash>=2.6.0->open3d) Downloading retrying-1.3.4-py3-none-any.whl.metadata (6.9 kB) Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (1.6.0) Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (75.1.0) Collecting comm>=0.1.3 (from ipywidgets>=8.0.4->open3d) Downloading comm-0.2.2-py3-none-any.whl.metadata (3.7 kB) Requirement already satisfied: ipython>=6.1.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets>=8.0.4->open3d) (7.34.0) Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.10/dist-packages (from ipywidgets>=8.0.4->open3d) (5.7.1) Collecting widgetsnbextension~=4.0.12 (from ipywidgets>=8.0.4->open3d) Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB) Requirement already satisfied: jupyterlab-widgets~=3.0.12 in /usr/local/lib/python3.10/dist-packages (from ipywidgets>=8.0.4->open3d) (3.0.13) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (1.3.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (4.55.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (1.4.7) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (24.2) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (3.2.0) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (2.8.2) Requirement already satisfied: fastjsonschema>=2.15 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7.0->open3d) (2.20.0) Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7.0->open3d) (4.23.0) Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7.0->open3d) (5.7.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0->open3d) (2024.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0->open3d) (2024.2) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21->open3d) (1.13.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21->open3d) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21->open3d) (3.5.0) Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=2.2.3->open3d) (3.0.2) Requirement already satisfied: Jinja2>=3.1.2 in /usr/local/lib/python3.10/dist-packages (from Flask<3.1,>=1.0.4->dash>=2.6.0->open3d) (3.1.4) Requirement already satisfied: itsdangerous>=2.1.2 in /usr/local/lib/python3.10/dist-packages (from Flask<3.1,>=1.0.4->dash>=2.6.0->open3d) (2.2.0) Requirement already satisfied: click>=8.1.3 in /usr/local/lib/python3.10/dist-packages (from Flask<3.1,>=1.0.4->dash>=2.6.0->open3d) (8.1.7) Requirement already satisfied: blinker>=1.6.2 in /usr/local/lib/python3.10/dist-packages (from Flask<3.1,>=1.0.4->dash>=2.6.0->open3d) (1.9.0) Collecting jedi>=0.16 (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB) Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (4.4.2) Requirement already satisfied: pickleshare in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.7.5) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (3.0.48) Requirement already satisfied: pygments in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (2.18.0) Requirement already satisfied: backcall in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.2.0) Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.1.7) Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (4.9.0) Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7.0->open3d) (24.2.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7.0->open3d) (2024.10.1) Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7.0->open3d) (0.35.1) Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7.0->open3d) (0.21.0) Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core!=5.0.*,>=4.12->nbformat>=5.7.0->open3d) (4.3.6) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly>=5.0.0->dash>=2.6.0->open3d) (9.0.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib>=3->open3d) (1.16.0) Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata->dash>=2.6.0->open3d) (3.21.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->dash>=2.6.0->open3d) (3.4.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->dash>=2.6.0->open3d) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->dash>=2.6.0->open3d) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->dash>=2.6.0->open3d) (2024.8.30) Requirement already satisfied: parso<0.9.0,>=0.8.4 in /usr/local/lib/python3.10/dist-packages (from jedi>=0.16->ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.8.4) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.10/dist-packages (from pexpect>4.3->ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.7.0) Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.2.13) Downloading open3d-0.18.0-cp310-cp310-manylinux_2_27_x86_64.whl (399.7 MB) ββββββββββββββββββββββββββββββββββββββββ 399.7/399.7 MB 4.0 MB/s eta 0:00:00 Downloading dash-2.18.2-py3-none-any.whl (7.8 MB) ββββββββββββββββββββββββββββββββββββββββ 7.8/7.8 MB 85.2 MB/s eta 0:00:00 Downloading dash_core_components-2.0.0-py3-none-any.whl (3.8 kB) Downloading dash_html_components-2.0.0-py3-none-any.whl (4.1 kB) Downloading dash_table-5.0.0-py3-none-any.whl (3.9 kB) Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB) ββββββββββββββββββββββββββββββββββββββββ 139.8/139.8 kB 14.0 MB/s eta 0:00:00 Downloading werkzeug-3.0.6-py3-none-any.whl (227 kB) ββββββββββββββββββββββββββββββββββββββββ 228.0/228.0 kB 20.6 MB/s eta 0:00:00 Downloading addict-2.4.0-py3-none-any.whl (3.8 kB) Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB) Downloading pyquaternion-0.9.9-py3-none-any.whl (14 kB) Downloading comm-0.2.2-py3-none-any.whl (7.2 kB) Downloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB) ββββββββββββββββββββββββββββββββββββββββ 2.3/2.3 MB 68.0 MB/s eta 0:00:00 Downloading retrying-1.3.4-py3-none-any.whl (11 kB) Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB) ββββββββββββββββββββββββββββββββββββββββ 1.6/1.6 MB 57.3 MB/s eta 0:00:00 Installing collected packages: dash-table, dash-html-components, dash-core-components, addict, widgetsnbextension, werkzeug, retrying, pyquaternion, jedi, configargparse, comm, ipywidgets, dash, open3d Attempting uninstall: widgetsnbextension Found existing installation: widgetsnbextension 3.6.10 Uninstalling widgetsnbextension-3.6.10: Successfully uninstalled widgetsnbextension-3.6.10 Attempting uninstall: werkzeug Found existing installation: Werkzeug 3.1.3 Uninstalling Werkzeug-3.1.3: Successfully uninstalled Werkzeug-3.1.3 Attempting uninstall: ipywidgets Found existing installation: ipywidgets 7.7.1 Uninstalling ipywidgets-7.7.1: Successfully uninstalled ipywidgets-7.7.1 Successfully installed addict-2.4.0 comm-0.2.2 configargparse-1.7 dash-2.18.2 dash-core-components-2.0.0 dash-html-components-2.0.0 dash-table-5.0.0 ipywidgets-8.1.5 jedi-0.19.2 open3d-0.18.0 pyquaternion-0.9.9 retrying-1.3.4 werkzeug-3.0.6 widgetsnbextension-4.0.13 Collecting git+https://github.com/facebookresearch/segment-anything.git Cloning https://github.com/facebookresearch/segment-anything.git to /tmp/pip-req-build-d0dgdqga Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/segment-anything.git /tmp/pip-req-build-d0dgdqga Resolved https://github.com/facebookresearch/segment-anything.git to commit dca509fe793f601edb92606367a655c15ac00fdf Preparing metadata (setup.py) ... done Building wheels for collected packages: segment_anything Building wheel for segment_anything (setup.py) ... done Created wheel for segment_anything: filename=segment_anything-1.0-py3-none-any.whl size=36592 sha256=10ef3c7f494500f63259acb932f22bdc07710916df4afcd7ac5d2271a9eec823 Stored in directory: /tmp/pip-ephem-wheel-cache-ty7okx0o/wheels/10/cf/59/9ccb2f0a1bcc81d4fbd0e501680b5d088d690c6cfbc02dc99d Successfully built segment_anything Installing collected packages: segment_anything Successfully installed segment_anything-1.0 --2024-12-01 04:16:55-- https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.219.59, 13.227.219.70, 13.227.219.10, ... Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.219.59|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 2564550879 (2.4G) [binary/octet-stream] Saving to: βckpts/sam_vit_h_4b8939.pthβ sam_vit_h_4b8939.pt 100%[===================>] 2.39G 156MB/s in 14s 2024-12-01 04:17:10 (170 MB/s) - βckpts/sam_vit_h_4b8939.pthβ saved [2564550879/2564550879] Requirement already satisfied: gdown in /usr/local/lib/python3.10/dist-packages (5.2.0) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from gdown) (4.12.3) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from gdown) (3.16.1) Requirement already satisfied: requests[socks] in /usr/local/lib/python3.10/dist-packages (from gdown) (2.32.3) Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from gdown) (4.66.6) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->gdown) (2.6) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (3.4.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (2024.8.30) Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (1.7.1) Downloading... From (original): https://drive.google.com/uc?id=1K375xNjWAwZ7kmhjTuJccC3y6Ik5tN4q From (redirected): https://drive.google.com/uc?id=1K375xNjWAwZ7kmhjTuJccC3y6Ik5tN4q&confirm=t&uuid=4186b7f4-3ef1-451e-a2af-83a45e9bdbd2 To: /content/images.zip 100% 118M/118M [00:02<00:00, 57.9MB/s] Archive: images.zip creating: images/ inflating: images/data_demo.gif creating: images/dataset/ inflating: images/dataset/transforms_train.json creating: images/dataset/train/ inflating: images/dataset/train/r_58_0.png inflating: images/dataset/train/r_20_0.png inflating: images/dataset/train/r_30_0.png inflating: images/dataset/train/r_47_0.png inflating: images/dataset/train/r_22_0.png inflating: images/dataset/train/r_48_0.png inflating: images/dataset/train/r_4_0.png inflating: images/dataset/train/r_40_0.png inflating: images/dataset/train/r_2_0.png inflating: images/dataset/train/r_3_0.png inflating: images/dataset/train/r_68_0.png inflating: images/dataset/train/r_33_0.png inflating: images/dataset/train/r_15_0.png inflating: images/dataset/train/r_71_0.png inflating: images/dataset/train/r_66_0.png inflating: images/dataset/train/r_99_0.png inflating: images/dataset/train/r_72_0.png inflating: images/dataset/train/r_13_0.png inflating: images/dataset/train/r_54_0.png inflating: images/dataset/train/r_96_0.png inflating: images/dataset/train/r_81_0.png inflating: images/dataset/train/r_78_0.png inflating: images/dataset/train/r_41_0.png inflating: images/dataset/train/r_52_0.png inflating: images/dataset/train/r_29_0.png inflating: images/dataset/train/r_56_0.png inflating: images/dataset/train/r_16_0.png inflating: images/dataset/train/depth.npy inflating: images/dataset/train/r_43_0.png inflating: images/dataset/train/r_93_0.png inflating: images/dataset/train/r_53_0.png inflating: images/dataset/train/r_90_0.png inflating: images/dataset/train/r_89_0.png inflating: images/dataset/train/r_17_0.png inflating: images/dataset/train/r_77_0.png inflating: images/dataset/train/r_75_0.png inflating: images/dataset/train/r_39_0.png inflating: images/dataset/train/r_80_0.png inflating: images/dataset/train/r_79_0.png inflating: images/dataset/train/r_36_0.png inflating: images/dataset/train/r_92_0.png inflating: images/dataset/train/r_67_0.png inflating: images/dataset/train/r_83_0.png inflating: images/dataset/train/r_70_0.png inflating: images/dataset/train/r_10_0.png inflating: images/dataset/train/r_88_0.png inflating: images/dataset/train/r_50_0.png inflating: images/dataset/train/r_1_0.png inflating: images/dataset/train/r_12_0.png inflating: images/dataset/train/r_37_0.png inflating: images/dataset/train/r_94_0.png inflating: images/dataset/train/r_23_0.png inflating: images/dataset/train/r_18_0.png inflating: images/dataset/train/r_55_0.png inflating: images/dataset/train/r_25_0.png inflating: images/dataset/train/r_21_0.png inflating: images/dataset/train/r_34_0.png inflating: images/dataset/train/r_19_0.png inflating: images/dataset/train/r_31_0.png inflating: images/dataset/train/r_74_0.png inflating: images/dataset/train/r_98_0.png inflating: images/dataset/train/r_7_0.png inflating: images/dataset/train/r_9_0.png inflating: images/dataset/train/r_14_0.png inflating: images/dataset/train/r_38_0.png inflating: images/dataset/train/r_87_0.png inflating: images/dataset/train/r_26_0.png inflating: images/dataset/train/r_24_0.png inflating: images/dataset/train/r_62_0.png inflating: images/dataset/train/r_85_0.png inflating: images/dataset/train/r_5_0.png inflating: images/dataset/train/r_28_0.png inflating: images/dataset/train/r_42_0.png inflating: images/dataset/train/r_86_0.png inflating: images/dataset/train/r_46_0.png inflating: images/dataset/train/r_49_0.png inflating: images/dataset/train/r_61_0.png inflating: images/dataset/train/r_82_0.png inflating: images/dataset/train/r_44_0.png inflating: images/dataset/train/r_91_0.png inflating: images/dataset/train/r_65_0.png inflating: images/dataset/train/r_32_0.png inflating: images/dataset/train/r_84_0.png inflating: images/dataset/train/r_73_0.png inflating: images/dataset/train/r_35_0.png inflating: images/dataset/train/r_76_0.png inflating: images/dataset/train/r_59_0.png inflating: images/dataset/train/r_11_0.png inflating: images/dataset/train/r_0_0.png inflating: images/dataset/train/r_95_0.png inflating: images/dataset/train/r_27_0.png inflating: images/dataset/train/r_8_0.png inflating: images/dataset/train/r_6_0.png inflating: images/dataset/train/r_60_0.png inflating: images/dataset/train/r_45_0.png inflating: images/dataset/train/r_57_0.png inflating: images/dataset/train/r_51_0.png inflating: images/dataset/train/r_63_0.png inflating: images/dataset/train/r_69_0.png inflating: images/dataset/train/r_64_0.png inflating: images/dataset/train/r_97_0.png inflating: images/expected_output.png inflating: images/cam_frames.png
from IPython.display import Image
img_size = 400
Image(filename="images/data_demo.gif", width=img_size, height=img_size)
<IPython.core.display.Image object>
Environment Set-up without Google ColabΒΆ
If you're not running on Google Colab, use the prep_no_colab.sh script to install the right libraries, pull the model checkpoint and download the data. This script was tested on Ubuntu Linux only. After running the script your folder should look something like this:
βββ images
β βββ dataset
β β βββ train
| | βββ test
| | βββ val
β β βββ transforms_train.json
β β βββ transforms_test.json
β β βββ transforms_val.json
βββ ckpts
β βββ sam_vit_h_4b8939.pth
Our recommended method for loading iPython notebooks on your local computer is to use a Visual Studio Code plugin, here is a short tutorial on how to do that. Are you having issues setting up your system? Problems with Cuda versions? Use Colab instead, or ask a TA if you really want to use your own compute.
Set-upΒΆ
Necessary imports and helper functions for displaying points, boxes, and masks.
%matplotlib inline
import numpy as np
import torch
import matplotlib.pyplot as plt
import cv2
import sys
import os
def show_mask(mask, ax, random_color=False):
# This function is used to visualize the mask on the image in a matplotlib axis.
# bool mask: (H, W). True for each pixel that belongs to the object.
# ax: matplotlib axis
# random_color: if True, use a random color for the mask. Otherwise, use blue.
if random_color:
color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
else:
color = np.array([30/255, 144/255, 255/255, 0.6])
h, w = mask.shape[-2:]
mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
ax.imshow(mask_image)
def show_box(box, ax):
# This function is used to visualize the bounding box on the image in a matplotlib axis.
# box: (4,) array. [x0, y0, x1, y1]
# (x0, y0): top-left corner
# (x1, y1): bottom-right corner
# ax: matplotlib axis
x0, y0 = box[0], box[1]
w, h = box[2] - box[0], box[3] - box[1]
ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))
Q1: Loading and Understanding the Dataset [2 pts]ΒΆ
The dataset you'll be working with is a synthetic dataset generated specifically for this homework. We used the free and open-source 3D graphics software tool Blender to render images from 100 different poses. You will have access to the following data:
- The intrinsics parameters, constant for all images. The dataset was rendered using a pinhole camera model without distortion. Therefore, the intrinsics can be captured using camera matrix K.
- The extrinsics, as a [100 x 4 x 4] array. Each [4 x 4] matrix gives the cam2world transformation.
- File path to the 100 images, each of shape [800 x 800 x 3].
- 100 depth images, each of shape [800 x 800 x 3]. Each channel has the depth in meters, so 2 channels are redundant.
We will now load and visualize the dataset.
Q1.1 Compute the camera matrix K [2 pts]ΒΆ
import json
# dataset class provided to load extrinsics, intrinsics and image paths.
class Dataset:
def __init__(self, json):
self.json = json # the json file containing the extrinsics, intrinsics and image paths.
self.load_extrinsics()
self.load_intrinsics()
self.compute_intrinsics()
def load_extrinsics(self):
# This function loads the extrinsics parameters from the json file.
with open(self.json) as f:
self.data = json.load(f)
self.frames = self.data['frames'] # 'frames' in the json contains the extrinsics and image path for each image.
self.transforms = np.array([frame['transform_matrix'] for frame in self.frames]) # extrinsic matrix for each image, shape (N, 4, 4)
self.file_paths = np.array([frame['file_path'] for frame in self.frames]) # path to each image, shape (N,)
def load_intrinsics(self):
# This function loads the intrinsics parameters from the json file.
self.f_x = self.data['fl_x'] # focal length in x
self.f_y = self.data['fl_y'] # focal length in y
self.w = self.data['w'] # image width
self.h = self.data['h'] # image height
self.cx = self.data['cx'] # principal point in x
self.cy = self.data['cy'] # principal point in y
def compute_intrinsics(self):
# self.K = None # K: the intrinsic matrix, shape (3, 3)
# compute the K matrix form the intrinsic parameters computed in load_intrinsics() : [2 pts]
# TODO: YOUR CODE HERE
self.K = np.array([[self.f_x, 0, self.cx],
[0, self.f_y, self.cy],
[0, 0, 1]])
dataset = Dataset('images/dataset/transforms_train.json') # load the dataset
np.set_printoptions(precision=3, suppress=True) # do NOT remove this line when you print matrices for grading
print('Shape of extrinsic matrices: {}'.format(dataset.transforms.shape)) # all extrinsic matrices, shape (N, 4, 4)
#TODO: print the intrinsic matrix and add to your gradescope submission.
print('K matrix {}'.format(dataset.K)) # The intrinsic matrix K you computed.
Shape of extrinsic matrices: (100, 4, 4) K matrix [[1111.111 0. 400. ] [ 0. 1111.111 400. ] [ 0. 0. 1. ]]
Visualize the dataset [0 pts]ΒΆ
Here we show the RGB and depth data that are part of the dataset. Reasoning about algorithm design is often easier when you understand the data.
image = cv2.imread(os.path.join('images/dataset',dataset.file_paths[0]))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(10,10))
plt.imshow(image)
plt.axis('on')
plt.show()
def depthmap_viz(depth,min_d=0.0,max_d=3.5):
# depth: (H,W,3) - depth map, every channel contains the same depth values for that pixel. 2 channels are redundant.
# min_d: minimum depth value to visualize
# max_d: maximum depth value to visualize
depth = np.clip(depth,min_d,max_d)
depth = (depth-min_d)/(max_d - min_d)
image = depth
plt.clf()
plt.imshow(depth,cmap='magma', vmin=min_d,vmax=max_d)
depth_location = 'images/dataset/train/depth.npy' # location of ground truth depth maps.
depths = np.load(depth_location) # load the depth maps
depthmap_viz(depths[0]) # visualize the first depth map
plt.show() # show the plot
Loading SAM [0 pts]ΒΆ
The Segment Anything Model (SAM) produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.
Here we load the SAM model and predictor. Running on CUDA and using the default model are recommended for best results. Do not change any settings, as it will complicate our grading, you may not receive full credit if you alter the SAM settings.
import sys
from segment_anything import sam_model_registry, SamPredictor
sam_checkpoint = "ckpts/sam_vit_h_4b8939.pth" # the checkpoint loaded in the setup section.
model_type = "vit_h"
device = "cuda" # loading to GPU.
sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)
predictor = SamPredictor(sam)
/usr/local/lib/python3.10/dist-packages/segment_anything/build_sam.py:105: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. state_dict = torch.load(f)
Q2: Designing our prompt [2 point]ΒΆ
SAM and other segmentation models use input prompts such as a box around an object to generate a mask of the object. From now on if we refer to a prompt we mean the box around an object that is to be segmented, which is the input to SAM.
In this homework we will start by acquiring a single user-generated prompt in one image: a box around the coffee mug. We will then use depth and camera geometry to propogate the mask to other frames. It is therefore important for the one user-specified prompt to be high-quality. Set the input_box parameter such that we can get a high-quality segmentation.
Now we load in the first example image into the model.
#TODO: YOUR CODE HERE
# add this plot to gradescope submission.
# input_box = None # choose correct bounding box [2 pts]
input_box = np.array([350, 250, 450, 350])
plt.figure(figsize=(10, 10))
plt.imshow(image)
show_box(input_box, plt.gca())
plt.axis('off')
plt.show()
# Here we're running SAM on the image with the bounding box.
predictor.set_image(image) # loading the image to the predictor.
masks, _, _ = predictor.predict(
point_coords=None,
point_labels=None,
box=input_box[None, :],
multimask_output=False,
)
# Calling the predictor with the bounding box.
# You will not need to change any of the other arguments in this homework.
Visualizing the mask.ΒΆ
mask = masks[0]
h, w = mask.shape[-2:]
mask_image = mask.reshape(h, w, 1)
plt.figure(figsize=(10, 10))
plt.imshow(image)
show_mask(mask, plt.gca())
show_box(input_box, plt.gca())
plt.axis('off')
plt.show()
Q3: Project to 3D [20 points]ΒΆ
As discussed before, the aim of this homework is to use the image mask in one image and propagate it to novel views. In this section we will use the mask generated in the previous question, and project the pixels in the mask to 3D coordinates using depth and camera geometry. Remember we can project points in image coordinates to 3D points by using the P matrix, with P = K[R|t].
Q3.1: Image frame to camera frame [10 pts]ΒΆ
In this part of the question you will be asked to project the points from image frame, so pixel coordinates, to a point cloud in the camera frame. For this you will only need the intrinsic matrix K and the depth. You will not need dataset.transforms in Q2.1.
def img2cam(points, K, depths=None):
# project the points from image coordinates to camera coordinates [5 pts]
cam_3d = None
# steps todo:
# (1) Use the intrinsic matrix K to convert the points from image coordinates to a point cloud in the camera frame.
# (2) Normalize the points to a plane with z=1.
# (3) Use depths to scale the points to be at the correct distance from the camera.
# TODO: YOUR CODE HERE
# (1) Use the intrinsic matrix K to convert the points from image coordinates to a point cloud in the camera frame.
cam_3d = np.concatenate([points, np.ones((points.shape[0],1))], axis=1)
cam_3d = cam_3d.T
cam_3d = np.linalg.inv(K) @ cam_3d
# (2) Normalize the points to a plane with z=1.
cam_3d_norm = cam_3d / cam_3d[2, :] # normalize to z=1
# (3) Use depths to scale the points to be at the correct distance from the camera.
if depths is not None:
depths = depths.reshape(1, -1)
cam_3d_norm = cam_3d_norm * depths
cam_3d = cam_3d_norm
return cam_3d
def filter_points(coords, depths, thresh=2.55):
# filter out points that are too far away in the first mask, this first mask will be very important! [2 pts]
# return filtered coords and depths
# don't make it too complicated, this should be a one-liner.
# TODO: YOUR CODE HERE
return coords[depths.squeeze() < thresh], depths.squeeze()[depths.squeeze() < thresh]
def mask2cam(mask, K, depths, thresh=None):
# project mask points to camera frame [3 pts]
# steps todo:
# (1) get all coordinates where the mask is True, this should be N x 2
# (2) get the depth values for these coordinates, Nx1
# (3) call filter_points to filter out points that are too far away, with depth above the treshold.
# only call filter_points if thresh is not None
# Here far away means the depth is above a certain threshold.
# (4) call img2cam to convert the points to camera frame using intrinsics and depth.
# # TODO: YOUR CODE HERE
# (1) get all coordinates where the mask is True, this should be N x 2
coords = np.argwhere(mask)[:, :2]
# (2) get the depth values for these coordinates, Nx1
depth_values = depths[coords[:, 0], coords[:, 1], 1].reshape(-1, 1)
# # Print varifications
# print("Coordinates shape:", coords.shape)
# print("Depth values shape:", depth_values.shape)
# (3) call filter_points to filter out points that are too far away, with depth above the treshold.
if thresh is not None:
coords, depth_values = filter_points(coords, depth_values, thresh)
# (4) call img2cam to convert the points to camera frame using intrinsics and depth.
cam_3d = img2cam(coords, K, depth_values)
return np.vstack((cam_3d[1], cam_3d[0], cam_3d[2]))
cam_pnts_3d = mask2cam(mask_image,dataset.K,depths[0],thresh=2.55)
def viz_pts_3d(pts,xrange=None,yrange=None,zrange=None,title=None):
# viz the 3D points
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(pts[0,:],pts[1,:],pts[2,:],s=1)
ax.set_xlabel('X [m]')
ax.set_ylabel('Y [m]')
ax.set_zlabel('Z [m]')
if xrange is not None:
ax.set_xlim(xrange)
if yrange is not None:
ax.set_ylim(yrange)
if zrange is not None:
ax.set_zlim(zrange)
if title is not None:
ax.set_title(title)
plt.show()
np.save('cam_pnts_3d.npy',cam_pnts_3d)
#TODO: add this plot to gradescope submission
viz_pts_3d(cam_pnts_3d)
Q3.2: Camera frame to world frame [10 points]ΒΆ
We now project the points in the camera frame to the world frame, keep in mind that the transforms provided are between the world frame and the Blender camera frame (pictured below). You will need to take this difference into account when using the equations of projective geometry. For example, the intrinsics matrix K will expect 3D points in the OpenCV camera frame. Summarizing, these are the existing frames:
- Camera frame/OpenCV Camera Frame: This is the reference frame for 3D points with respect to the camera, as seen in the class up to now.
- Blender Camera Frame: This is the camera frame for 3D points used in Blender, the
yandzaxes are pointing in opposite directions w.r.t. OpenCV frame. - World Frame: This is the frame for 3D points with respect to the world origin.
- Image Frame: The 2D points in the image, e.g.,
u,vin range [0,H] and [0,W].
Image(filename="images/cam_frames.png", width=img_size)
def cam2world(points, transform):
# project camera coordinates to world coordinates [5 pts]
# NOTE: transform is the transformation from the blender camera frame to the world frame.
# TODO: YOUR CODE HERE
# Flip Y and Z axes
points = points * np.array([1, -1, -1]).reshape(3, 1)
# Homogenous coordinates
homo_points = np.concatenate([points, np.ones((1, points.shape[1]))], axis=0)
# Aoply transformation
world_points = np.dot(transform, homo_points)
return world_points
def world2cam(points, transform):
# project world coordinates to camera coordinates [5 pts]
# NOTE: do not use np.linalg.inv to compute the inverse of transform, we will award only partial credit.
# There is an intuitive and elegent way to compute the inverse of transform.
# NOTE: do not forget about blender coordinates!
# TODO: YOUR CODE HERE
# Rotation and Translation matrix
R = transform[:3, :3]
T = transform[:3, 3]
# Inverse rotation and translation
R_inv = R.T
T_inv = -np.dot(R_inv, T)
# Inverse transformation matrix
inverse_transform = np.hstack((np.vstack((R_inv, [0, 0, 0])), np.hstack((T_inv,[1])).reshape(-1,1)))
# Apply inverse transformation
cam_points = np.dot(inverse_transform, points)
# Normalize the points
cam_points /= cam_points[3]
# Flip Y and Z to match OpenCV coordinates
cam_points = np.vstack((cam_points[0], -cam_points[1], -cam_points[2]))
return cam_points
def show_mask(mask, ax, random_color=False):
# This function is used to visualize the mask on the image in a matplotlib axis.
# bool mask: (H, W). True for each pixel that belongs to the object.
# ax: matplotlib axis
# random_color: if True, use a random color for the mask. Otherwise, use blue.
if random_color:
color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
else:
color = np.array([30/255, 144/255, 255/255, 0.6])
h, w = mask.shape[-2:]
mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
ax.imshow(mask_image)
def show_box(box, ax):
# This function is used to visualize the bounding box on the image in a matplotlib axis.
# box: (4,) array. [x0, y0, x1, y1]
# ax: matplotlib axis
x0, y0 = box[0], box[1]
w, h = box[2] - box[0], box[3] - box[1]
ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))
transform_0 = dataset.transforms[0]
world_pts = cam2world(cam_pnts_3d,transform_0)
np.save('world_pts.npy',world_pts)
#TODO: add this plot to gradescope submission
viz_pts_3d(world_pts)
Q4: Casting masks to new frames [10 pts]ΒΆ
Now we will look at new viewpoints and extract their point clouds. To do this we will loop through new viewpoints one by one, to run SAM on each new image. Each iteration of the loop we add the new 3D points to the previous 3D points, which are then used in the next iteration to create a new mask using projective geometry and depth.
A simple approach might be to deploy the projective geometry we have developed, and find the bounding box at each iteration based on the coordinate range of the projected point cloud. That is, we project all 3D points to the novel view, an array of [N,2] with each row an x and y coordinate in the image frame. The bounding box could be simply [min_x,min_y,max_x,max_y]. What would be the problem with an approach like this?
Occlussions and noise! Noise can be from some pixels that weren't segmented correctly, occlussions can cause entirely errorneous masks. You will be implementing several functions to improve the results:
(1) Filter out masks that have a confidence that's too low
(2) Filter out points that are too far away from anything else we've seen so far.
(3) Carefully choose the bounding box dimensions to avoid noise to have a big effect on the result.
def cam2img(points, K):
# project camera coordinates to image coordinates [3 pts]
# output should be pixel coordinates in the correct range with shape (2, N)
# TODO: YOUR CODE HERE
# Apply camera intrinsic
img_points = np.dot(K, points)
# Normalize to pixel coordinates
img_points = img_points[:2, :] / img_points[2, :]
return img_points
# (1) Filter out masks that have a confidence that's too low. [0 pts]
score_thresh = 0.85
def keep_score(score):
return score > score_thresh
# (2) Filter out points that are too far away from anything else we've seen so far.
dist_thresh = 0.1 # decide on a good distance threshold [2 pts]
def keep_dist(new_pts, existing_pts):
# TODO: YOUR CODE HERE
# compute median of all_world_pts
# compute distance between tmp_world_pts and median of all_world_pts
# reject outliers that are too far away from all other points
# compue median of all_world_pts
median_all_world_pts = np.median(existing_pts, axis=1, keepdims=True)
# compute distance between tmp_world_pts and median of all_world_pts
distance = np.linalg.norm(new_pts - median_all_world_pts.reshape(-1, 1), axis=0)
# reject outliers that are too far away from all other points
keep_mask = distance < dist_thresh
return new_pts[:, keep_mask]
# (3) Carefully choose the bounding box dimensions to avoid noise to have a big effect on the result.
# filter out points n_std away from mean of all points [3 pts]
def filter_for_box(world_points,transform, n_std=2):
# TODO: YOUR CODE HERE
# compute mean and std of all points [2 pts]
# filter out points that are n_std away from mean of all points [3 pts]
# transform to cam frame [1 pt]
# you will need intrinscis matrix K here, simply call dataset.K
# compute mean of all points
mean = np.mean(world_points, axis=1)
# Translate points by subtracting the mean (zero-center)
translated_points = world_points - mean.reshape(-1, 1)
# compute std of all points
std = np.std(translated_points, axis=0)
# filter out points that are n_std away from mean of all points
keep_mask = np.abs(std) < n_std
# apply mask
filtered_world_points = world_points[:, keep_mask]
# # Debug: Check the shape after filtering
# print(f"Shape of filtered_world_points: {filtered_world_points.shape}")
# transform to cam frame
cam_points = world2cam(filtered_world_points, transform)
# you will need intrinscis matrix K here, simply call dataset.K
img_points = cam2img(cam_points, dataset.K)
return img_points
# based on the filtered points, compute the bounding box [2 pts]
def prompt_points_to_box(prompt):
# TODO: YOUR CODE HERE
# output: np.array([x0, y0, x1, y1]])
# (x0, y0): top-left corner
# (x1, y1): bottom-right corner
# Check if prompt contains points
if len(prompt[0]) == 0 or len(prompt[1]) == 0:
print("Error: No points available to compute bounding box.")
return None
# Remove outliers
x, y = np.sort(prompt[0]), np.sort(prompt[1])
# Check if after trimming, the arrays are not empty
if len(x) == 0 or len(y) == 0:
print("Error: After trimming, no points available.")
return None
# Remove outliers
x, y = x[10:-10], y[10:-10]
# compute bounding box
x0 = np.min(x)
y0 = np.min(y)
x1 = np.max(x)
y1 = np.max(y)
return np.array([np.min(x), np.min(y), np.max(x), np.max(y)])
import os
import copy
from sklearn.cluster import KMeans
all_world_pts = copy.deepcopy(world_pts)
it = 1
# show_its = [1,4,10]
show_its = [1,4,10,50,75,99]
end_idx = -1 # set this to a different number, e.g. 10, for faster debugging
#TODO: add all plots generated to gradescope submission
for transform, file_path, depth in zip(dataset.transforms[1:end_idx], dataset.file_paths[1:end_idx], depths[1:end_idx]):
# compute 3d points
image = cv2.imread(os.path.join('images/dataset/', file_path))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
prompt_points = filter_for_box(all_world_pts, transform) # (3) Carefully choose the bounding box dimensions to avoid noise to have a big effect on the result.
if it in show_its:
plt.imshow(image)
# scatter plot the img_pts
plt.scatter(prompt_points[0,:],prompt_points[1,:],s=5)
plt.title('It {}, prompt points.'.format(it))
plt.show()
predictor.set_image(image)
prompt_box = prompt_points_to_box(prompt_points) # (3) Choose the bounding box points based on the filtered points.
masks, scores, _ = predictor.predict(
box=prompt_box,
point_labels=[1],
multimask_output=False,
)
if it in show_its:
plt.figure(figsize=(10,10))
plt.imshow(image)
show_mask(masks, plt.gca())
show_box(prompt_box, plt.gca())
plt.axis('off')
plt.title('It {}, mask.'.format(it))
plt.show()
mask = masks.reshape((h, w, 1))
cam_pnts_3d = mask2cam(mask,dataset.K,depth)
tmp_world_pts = cam2world(cam_pnts_3d,transform)
if keep_score(scores): # (1) Filter out masks that have a score that's too low.
tmp_world_pts = keep_dist(tmp_world_pts,all_world_pts) # (2) Filter out points that are too far away from anything else we've seen so far.
all_world_pts = np.hstack([all_world_pts,tmp_world_pts])
if it in show_its:
viz_pts_3d(all_world_pts,title='It {}, all points found so far.'.format(it),xrange=[-0.1,0.1],yrange=[-0.45,-0.25],zrange=[0.06,0.15])
it+=1
Visualize all 3D pointsΒΆ
#TODO: do NOT need to add to gradescope submission
viz_pts_3d(all_world_pts,xrange=[-0.1,0.1],yrange=[-0.45,-0.25],zrange=[0.06,0.15])
Q5: Statistical outlier removal [3 pts]ΒΆ
def filter_points(points,nb_neighbors=20,std_ratio=2.0):
import open3d as o3d
# filter points using open3d statistical outlier removal. [3 pts]
# TODO: YOUR CODE HERE
# Create point cloud data
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points.T)
# Run statistical outlier removal
pcd_filtered, ind = pcd.remove_statistical_outlier(nb_neighbors, std_ratio)
return np.asarray(pcd_filtered.points).T
all_world_pts_filtered = filter_points(all_world_pts[:3,:])
#TODO: add this plot to gradescope submission
viz_pts_3d(all_world_pts_filtered,xrange=[-0.1,0.1],yrange=[-0.45,-0.25],zrange=[0.06,0.15])
Expected OutputΒΆ
Below is the output we achieved at the end of the homework, you implementation should be similar to receive full credit.
Image(filename="images/expected_output.png", width=img_size, height=img_size)
Q6 Extra credit: 3D Segmentation without Ground Truth Depth [10 pts Max]ΒΆ
So far we have provided you with ground truth depth from the 3D rendering toolbox. For a maximum of 10 extra points, can you achieve similar accuracy without using ground truth depth? You can use any toolbox/repository you like, as long as they infer dense depth maps: depth for every pixel in the image. Here is one approach we believe would be relatively straightforward:
- The dataset is formatted for Neural Radiance Fields (NeRFs). You should be able to run NeRF on this dataset with little modifications.
- Suggested NeRF pipeline: Torch-NGP. It runs fast using a cuda backend, but all the high-level features are implemented in Torch.
- Some necessary changes: (1) Modify the code to only use the training data, no testing or validation (not provided in dataset). You could also copy training data to a test and validation folder and create the necessary json files. (2) Modify the code to return all train depth maps in a
.npyarray, in units [meters].
Any other methods are allowed and encouraged, as long as they infer dense depth maps: depth for every pixel in the image. Keep in mind the difference between z-depth as in Blender (the distance along the cameras principle axis), depth as euclidean distance from the camera center.